Introduction to Open Data Science. Happy to learn to connect Git and R even though the content of data analysis is familiar to me. I’ve used R MarkDown before and found it useful. In addition I’ve used Git before on Tietokantojen Perusteet course. However, it’s been a long time so recalling things is needed. Here’s the link to my IODS GitHub repository.
Today is the following day.
## [1] "Thu Dec 3 12:01:14 2020"
I wanted to remind me what I’ve done last time with R MarkDown. I found a nice data on exchange and forward rates. I make a table and a graph of Sterling/EUR exchange rate. I uploaded forward2.dat to git repository and I call my data set from there in order to plot the rate. I hide the R code to make the outcome more readable.
## Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
| EXUSBP | EXUSEUR | EXEURBP | F1USBP | F1USEUR | F1EURBP | F3USBP | F3USEUR | F3EURBP | |
|---|---|---|---|---|---|---|---|---|---|
| Min. :1.073 | Min. :0.5827 | Min. :0.3567 | Min. :1.067 | Min. :0.5873 | Min. :0.3588 | Min. :1.061 | Min. :0.5941 | Min. :0.3624 | |
| 1st Qu.:1.507 | 1st Qu.:0.8876 | 1st Qu.:0.4855 | 1st Qu.:1.504 | 1st Qu.:0.8885 | 1st Qu.:0.4845 | 1st Qu.:1.501 | 1st Qu.:0.8926 | 1st Qu.:0.4841 | |
| Median :1.617 | Median :1.0738 | Median :0.5598 | Median :1.616 | Median :1.0774 | Median :0.5589 | Median :1.609 | Median :1.0855 | Median :0.5588 | |
| Mean :1.665 | Mean :1.0416 | Mean :0.6213 | Mean :1.663 | Mean :1.0447 | Mean :0.6203 | Mean :1.658 | Mean :1.0503 | Mean :0.6184 | |
| 3rd Qu.:1.756 | 3rd Qu.:1.1741 | 3rd Qu.:0.7107 | 3rd Qu.:1.753 | 3rd Qu.:1.1778 | 3rd Qu.:0.7091 | 3rd Qu.:1.748 | 3rd Qu.:1.1819 | 3rd Qu.:0.7051 | |
| Max. :2.443 | Max. :1.4222 | Max. :1.6002 | Max. :2.441 | Max. :1.4240 | Max. :1.5954 | Max. :2.433 | Max. :1.4278 | Max. :1.5863 |
I’m happy to see that R MarkDown is linked to LaTeX syntax! Here’s a maximization problem from my current research on the optimal mechanism design with enforcement: \[\begin{align} \max_{r(\cdot),t(\cdot),m(\cdot)} \mathbb{E} \left[ t(\theta) - K m(\theta) \right]\label{OBJ} \tag{MAX} \end{align}\] subject to \[\begin{align} &t(\theta) = \theta r(\theta) - V(\underline{\theta}) - \int_{\underline{\theta}}^{\theta} \mathcal{I}(s|s)ds \label{TAX} \tag{TAX} \\ &\mathcal{I}(\theta|\theta) \qquad \text{is nondecreasing} \label{IC} \tag{IC} \\ &\mathcal{I}(\theta|\theta) \geq 0 \label{IR} \tag{IR} \end{align}\] for all \(\theta \in \Theta\) with \(\mathcal{I}(\theta|\theta):=r(\theta) - m(\theta) \varphi\).
Today is
## [1] "Thu Dec 3 12:01:15 2020"
In this week we learn how to create data and analyse it. We read the data from the given URL. The meta descriptions can be found from here.
The dataset queries the relationship between learning approaches and students’ achievements in an introductory statistics course in Finland. The original data have 183 responders (observations) for 97 questions (variables). For our purposes, we subset the dataset to contain variables gender, age, attitude, deep, stra, surf and points. The variables are either integers or numerics. The meta-file gives the following description for each variable:
Next we omit the observations where the exam points variable is zero. After that we scale variables by the mean – that is, we divide each combination variable by its number of questions asked. So we have 166 observations for 7 variables left.
Let us next read the data we created earlier in the data wranling part:
# Read the data
learning2014 <- read.csv("/Users/teemupekkarinen/IODS-project/data/learning2014",row.names = 1)
To summarize our dataset’s content we run
# Structure
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ Gender : int 2 1 2 1 1 2 1 2 1 2 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ Deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ Stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ Surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
# Dimension
dim(learning2014)
## [1] 166 7
That is, we have the information about gender, age, attitude, and three different learning approaches (deep, strategic, and surface) of 166 responders. All of the variables are numeric and the combination variables (attitude, deep, stra, and surf) are means. For more detailed description we run:
# Summary
summary(learning2014)
## Gender Age Attitude Deep
## Min. :1.000 Min. :17.00 Min. :1.400 Min. :1.583
## 1st Qu.:1.000 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Median :2.000 Median :22.00 Median :3.200 Median :3.667
## Mean :1.663 Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:2.000 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :2.000 Max. :55.00 Max. :5.000 Max. :4.917
## Stra Surf Points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
We observe that there are almost twice more female responders than male responders: 56 men and 110 women. The average age of the responders is 26 and the youngest person is 17 years old and the oldest 55 years old. The age distribution is wide but skewed towards young people. As summarize we can say than a typical responder is a young female.
# Histrograms
Gender <- learning2014$Gender
hist(Gender,breaks=2)
sum(Gender==1)
## [1] 56
Age <- learning2014$Age
hist(Age)
mean(Age)
## [1] 25.51205
var(Age)
## [1] 60.31198
The average exam points is 28 with maximum 33 and minimum 7. The exam results are more or less normally distributed among all participants.
# Histrograms
Points <- learning2014$Points
hist(Points)
However, when we compare the points between men and female, we observe that the mean of exam points for men is slightly greater than for women. There are even more men with score 30 or 31 than women.
# Package
library(ggplot2)
# Histrograms
PointsMale <- learning2014$Points[Gender==1]
mean(PointsMale)
## [1] 23.48214
PointsFemale <- learning2014$Points[Gender==2]
mean(PointsFemale)
## [1] 22.32727
data <- data.frame(
type = c( rep("Male", 56), rep("Female", 110) ),
value = c(PointsMale, PointsFemale)
)
ggplot(data, aes(x=value, fill=type)) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
The mean of attitude, 3.14, is almost neutral 3. That is, on average responders have a bit positive attitude towards statistics. Also attitude is nearly normally distributed. Male responders have more positive attitude than female.
# Histrograms
Attitude <- learning2014$Attitude
hist(Attitude)
AttitudeMale <- learning2014$Attitude[Gender==1]
mean(PointsMale)
## [1] 23.48214
AttitudeFemale <- learning2014$Attitude[Gender==2]
mean(PointsFemale)
## [1] 22.32727
data <- data.frame(
type = c( rep("Male", 56), rep("Female", 110) ),
value = c(AttitudeMale, AttitudeFemale)
)
ggplot(data, aes(x=value, fill=type)) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
The minority prefers surface learning approach to learning in the statistics course, whereas the majority prefers the methods of deep learning. Moreover, the greatest variation in learning methods is in strategic learning. Deep and surface learning is almost identically distributed expect deep learning has a greater mean.
# Package
library(ggplot2)
# Histrograms
Deep <- learning2014$Deep
Stra <- learning2014$Stra
Surf <- learning2014$Surf
data <- data.frame(
type = c( rep("Deep", 166), rep("Stra", 166), rep("Surf", 166) ),
value = c(Deep, Stra, Surf)
)
ggplot(data, aes(x=value, fill=type)) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
There is no much difference between genders in learning methods; the distributions between male and female responders are almost indentically distributed.
# Histrograms
# Strategic Learning
StraMale <- learning2014$Stra[Gender==1]
StraFemale <- learning2014$Stra[Gender==2]
data <- data.frame(
type = c( rep("Male", 56), rep("Female", 110) ),
value = c(StraMale, StraFemale)
)
ggplot(data, aes(x=value, fill=type)) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
# Deep Learning
DeepMale <- learning2014$Deep[Gender==1]
DeepFemale <- learning2014$Deep[Gender==2]
data <- data.frame(
type = c( rep("Male", 56), rep("Female", 110) ),
value = c(DeepMale, DeepFemale)
)
ggplot(data, aes(x=value, fill=type)) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
# Surface Learning
SurfMale <- learning2014$Surf[Gender==1]
SurfFemale <- learning2014$Surf[Gender==2]
data <- data.frame(
type = c( rep("Male", 56), rep("Female", 110) ),
value = c(SurfMale, SurfFemale)
)
ggplot(data, aes(x=value, fill=type)) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
Next we run regressions where exam points is our dependent variable. Let us first look at the correlations between exam points and the explanatory variables.
plot(Attitude, Points, main = "Regression",
xlab = "Attitude", ylab = "Points") + abline(lm(Points~Attitude), col = "blue")
## integer(0)
plot(Gender, Points, main = "Regression",
xlab = "Gender", ylab = "Points") + abline(lm(Points~Gender), col = "blue")
## integer(0)
plot(Age, Points, main = "Regression",
xlab = "Age", ylab = "Points") + abline(lm(Points~Age), col = "blue")
## integer(0)
plot(Deep, Points, main = "Regression",
xlab = "Deep Learning", ylab = "Points") + abline(lm(Points~Deep), col = "blue")
## integer(0)
plot(Stra, Points, main = "Regression",
xlab = "Strategic Learning", ylab = "Points") + abline(lm(Points~Stra), col = "blue")
## integer(0)
plot(Surf, Points, main = "Regression",
xlab = "Surface Learning", ylab = "Points") + abline(lm(Points~Surf), col = "blue")
## integer(0)
cov(learning2014)
## Gender Age Attitude Deep Stra Surf
## Gender 0.22489960 -0.4383352 -0.10184739 -0.01526713 0.05326762 0.02826457
## Age -0.43833516 60.3119752 0.12584520 0.10792260 0.61406079 -0.58075332
## Attitude -0.10184739 0.1258452 0.53276561 0.04458987 0.03478322 -0.06776013
## Deep -0.01526713 0.1079226 0.04458987 0.30706766 0.04127419 -0.09489017
## Stra 0.05326762 0.6140608 0.03478322 0.04127419 0.59572437 -0.06570525
## Surf 0.02826457 -0.5807533 -0.06776013 -0.09489017 -0.06570525 0.27967222
## Points -0.25972983 -4.2662651 1.87824388 -0.03315078 0.66483662 -0.45002434
## Points
## Gender -0.25972983
## Age -4.26626506
## Attitude 1.87824388
## Deep -0.03315078
## Stra 0.66483662
## Surf -0.45002434
## Points 34.74965316
From the single regressions and the covariance matrix we observe that there is negative correlation between points and gender, age, deep learning and surface learning. That is, what we already observed, men were doing better in the course than women. Moreover, it seems to be so that younger students got better exam results than the older participants. Strategic learning was the only method that has positive correlation between exam results. There is a big positive correaltion between attitude and points.
Next we choose first gender, attitude and age to regress exam points.
reg <- lm(Points~Gender+Attitude+Age)
summary(reg)
##
## Call:
## lm(formula = Points ~ Gender + Attitude + Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.76802 3.17695 4.019 8.93e-05 ***
## Gender 0.33054 0.91934 0.360 0.720
## Attitude 3.60657 0.59322 6.080 8.34e-09 ***
## Age -0.07586 0.05367 -1.414 0.159
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
From here we observe that attitude is the only significant explanatory variable by itself. However, by the F-test all three variables are together significant explanatories.
Nevertheless, following the instructions we drop off gender variable and run the regression again.
# Regression
reg <- lm(Points~Attitude+Age)
summary(reg)
##
## Call:
## lm(formula = Points ~ Attitude + Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.3354 -3.3095 0.2625 4.0005 10.4911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.57244 2.24943 6.034 1.04e-08 ***
## Attitude 3.54392 0.56553 6.267 3.17e-09 ***
## Age -0.07813 0.05315 -1.470 0.144
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.301 on 163 degrees of freedom
## Multiple R-squared: 0.2011, Adjusted R-squared: 0.1913
## F-statistic: 20.52 on 2 and 163 DF, p-value: 1.125e-08
Attitude remains to be the strongest predictor. Age is still unsignifant even though together with attitude it is significant (F-test). It turns out, after trying all possible combinations of explanatory variables, only regressing with attitude we get 5 % significant level of all regressors. We thus lastly regress only with attitude and get the following.
reg <- lm(Points~Attitude)
summary(reg)
##
## Call:
## lm(formula = Points ~ Attitude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## Attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
From the summary we observe that the intercept term is positive and significant: 11.63 with standard error of 1.83. This is the test score if attitude was “zero”. Each unit jump in attitution increases (not causal) exam results by 3.5 points. Hence, with maximum attitude our model predicts that the test result is \(11.63 + 5 * 3.5 = 29\) points.
We observe also that R-squared decreases a little everytime we omit an explanatory variable from the regression. R-squared is the measure that represents the proportion of the variance for the dependent variable that is explained by explanatory variables. The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model.
Lastly we plot the regression diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.
plot(reg)
Residuals vs Fitted values figure is a simple scatterplot between residuals and predicted values. It should look more or less random, which is roughly the case in our analysis. The red line is not exactly zero all the time, which is a sign of that the residuals may have a little positive predictive power (omitted variable bias).
The Normal QQ plot helps us to assess whether the residuals are roughly normally distributed. In our cases the tails scarper from the diagonal which may imply that we have some problems with assumption of normally distributed error terms. Probably some t-distribution could be better.
The last plot for the residuals vs leverage (Cook’s distance) tells us which points have the greatest influence on the regression (leverage points). It turns out that points 35, 56, and 71 have the strongest effects on the dependent variable.
For the causal inference we need to have very strong assumptions.
Before going to the assumptions, we introduce the vector and matrix notation. Define K-dimensional (column) vectors \(\boldsymbol{x}_i\) and \(\boldsymbol{\beta}\) as
\[ \underset{(K \times 1)}{\boldsymbol{x}_i} = \begin{pmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{iK} \end{pmatrix}, \underset{(K \times 1)}{\boldsymbol{\beta}} = \begin{pmatrix} \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{K} \end{pmatrix}.\] Also define \[ \underset{(n \times 1)}{\boldsymbol{y}} = \begin{pmatrix} y_{1} \\ \vdots \\ y_{n} \end{pmatrix}, \underset{(n \times 1)}{\boldsymbol{\varepsilon}} = \begin{pmatrix} \varepsilon_{1} \\ \vdots \\ \varepsilon_{n} \end{pmatrix}, \underset{(n \times K)}{\boldsymbol{X}} = \begin{pmatrix} \boldsymbol{x}'_{1} \\ \vdots \\ \boldsymbol{x}'_{n} \end{pmatrix} = \begin{pmatrix} x_{11} & \dots & x_{1K} \\ \vdots & \ddots & \vdots \\ x_{n1} & \dots & x_{nK} \end{pmatrix}. \] Scalar quantities will be given in normal font \(x\), vector quantities in bold lowercase \(\boldsymbol{x}\), and all vectors will be presumed to be column vectors. Matrix quantities will be in bold uppercase \(\boldsymbol{X}\). The transpose of a matrix is denoted by either \(\boldsymbol{X}'\) or \(\boldsymbol{X}^T\).
Using the matrix notation, the assumptions for the causal inference are the following
Assumption 1.1 (linearity): \[\underset{(n \times 1)}{\boldsymbol{y}} = \underset{\underbrace{(n \times K)(K \times 1)}_{(n \times 1)}}{\boldsymbol{X} \: \: \: \boldsymbol{\beta}} + \underset{(n \times 1)}{\boldsymbol{\varepsilon}}.\]
Assumption 1.2 (strict exogeneity): \[\mathbb{E}(\varepsilon_{i}|\boldsymbol{X}) = 0 \: \: \: (i = 1, 2, \dots, n).\]
Assumption 1.3 (no multicollinearity): The rank of the \(n \times K\) data matrix, \(\boldsymbol{X}\), is \(K\) with probability 1.
Assumption 1.4 (spherical error variance): \[(\text{homoskedasticity}) \: \: \mathbb{E}(\varepsilon_{i}^2|\boldsymbol{X}) = \sigma^2 > 0 \: \: \: (i = 1, 2, \dots, n),\] \[(\text{no correlation between observations}) \: \: \: \mathbb{E}(\varepsilon_{i}\varepsilon_{j}|\boldsymbol{X}) = 0 \: \: \: (i, j = 1, 2, \dots, n; i \neq j).\]
Even though our regression diagnostic plots are promising, I would not make any kind of causal interpretations from our model.
Today is
## [1] "Thu Dec 3 12:01:18 2020"
As you can see from the date stamp, I am doing this exercise that I missed due to personal issues. However, I wanted to learn also this part of the course and thus decided to do the exercises whether that is necessary or not.
We start by reading the data we created in the data wrangling part
joinedData <- read.csv("/Users/teemupekkarinen/IODS-project/data/joinedData.csv",row.names = 1)
and use the glimpse to our data.
library(dplyr)
glimpse(joinedData)
## Rows: 382
## Columns: 35
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP"…
## $ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F"…
## $ age <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15…
## $ address <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R", "U", "U", "U"…
## $ famsize <chr> "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "LE3", "L…
## $ Pstatus <chr> "T", "T", "T", "T", "T", "T", "T", "T", "T", "A", "A", "T"…
## $ Medu <int> 1, 1, 2, 2, 3, 3, 3, 2, 3, 3, 4, 1, 1, 1, 1, 1, 2, 2, 2, 3…
## $ Fedu <int> 1, 1, 2, 4, 3, 4, 4, 2, 1, 3, 3, 1, 1, 1, 2, 2, 1, 2, 3, 2…
## $ Mjob <chr> "at_home", "other", "at_home", "services", "services", "se…
## $ Fjob <chr> "other", "other", "other", "health", "services", "health",…
## $ reason <chr> "home", "reputation", "reputation", "course", "reputation"…
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "no…
## $ internet <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "ye…
## $ guardian <chr> "mother", "mother", "mother", "mother", "other", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 3, 1, 2, 1…
## $ studytime <int> 4, 2, 1, 3, 3, 3, 3, 2, 4, 4, 2, 1, 2, 2, 2, 2, 3, 4, 1, 2…
## $ failures <int> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2…
## $ schoolsup <chr> "yes", "yes", "yes", "yes", "no", "yes", "no", "yes", "no"…
## $ famsup <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "y…
## $ paid <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "n…
## $ activities <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "no", "no",…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "y…
## $ romantic <chr> "no", "yes", "no", "no", "yes", "no", "yes", "no", "no", "…
## $ famrel <int> 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 5, 5, 3, 3…
## $ freetime <int> 1, 3, 3, 3, 2, 3, 2, 1, 4, 3, 3, 3, 3, 4, 3, 2, 2, 1, 5, 3…
## $ goout <int> 2, 4, 1, 2, 1, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 1, 2…
## $ Dalc <int> 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1…
## $ Walc <int> 1, 4, 1, 1, 3, 1, 2, 3, 3, 1, 1, 2, 3, 2, 1, 2, 1, 1, 1, 1…
## $ health <int> 1, 5, 2, 5, 3, 5, 5, 4, 3, 4, 1, 4, 4, 5, 5, 1, 4, 3, 5, 3…
## $ absences <int> 3, 2, 8, 2, 5, 2, 0, 1, 9, 10, 0, 3, 2, 0, 4, 1, 2, 6, 2, …
## $ G1 <int> 10, 10, 14, 10, 12, 12, 11, 10, 16, 10, 14, 10, 11, 10, 12…
## $ G2 <int> 12, 8, 13, 10, 12, 12, 6, 10, 16, 10, 14, 6, 11, 12, 12, 1…
## $ G3 <int> 12, 8, 12, 9, 12, 12, 6, 10, 16, 10, 15, 6, 11, 12, 12, 14…
## $ alc_use <dbl> 1.0, 3.0, 1.0, 1.0, 2.5, 1.0, 2.0, 2.0, 2.5, 1.0, 1.0, 1.5…
## $ high_use <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE…
We have 382 observations and 35 variables. There are individual level background variables like gender, age, school performance, and alcohol consumption. Moreover, we have variables that are related to individuals parental background like family size, parents’ cohabitation status, and parental education status.
The summary of the data is the following.
summary(joinedData)
## school sex age address
## Length:382 Length:382 Min. :15.00 Length:382
## Class :character Class :character 1st Qu.:16.00 Class :character
## Mode :character Mode :character Median :17.00 Mode :character
## Mean :16.59
## 3rd Qu.:17.00
## Max. :22.00
## famsize Pstatus Medu Fedu
## Length:382 Length:382 Min. :0.000 Min. :0.000
## Class :character Class :character 1st Qu.:2.000 1st Qu.:2.000
## Mode :character Mode :character Median :3.000 Median :3.000
## Mean :2.806 Mean :2.565
## 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :4.000 Max. :4.000
## Mjob Fjob reason nursery
## Length:382 Length:382 Length:382 Length:382
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## internet guardian traveltime studytime
## Length:382 Length:382 Min. :1.000 Min. :1.000
## Class :character Class :character 1st Qu.:1.000 1st Qu.:1.000
## Mode :character Mode :character Median :1.000 Median :2.000
## Mean :1.448 Mean :2.037
## 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :4.000 Max. :4.000
## failures schoolsup famsup paid
## Min. :0.0000 Length:382 Length:382 Length:382
## 1st Qu.:0.0000 Class :character Class :character Class :character
## Median :0.0000 Mode :character Mode :character Mode :character
## Mean :0.2016
## 3rd Qu.:0.0000
## Max. :3.0000
## activities higher romantic famrel
## Length:382 Length:382 Length:382 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:4.000
## Mode :character Mode :character Mode :character Median :4.000
## Mean :3.937
## 3rd Qu.:5.000
## Max. :5.000
## freetime goout Dalc Walc health
## Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000
## Median :3.00 Median :3.000 Median :1.000 Median :2.000 Median :4.000
## Mean :3.22 Mean :3.113 Mean :1.482 Mean :2.296 Mean :3.573
## 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000
## Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## absences G1 G2 G3 alc_use
## Min. : 0.0 Min. : 2.00 Min. : 4.00 Min. : 0.00 Min. :1.000
## 1st Qu.: 1.0 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:1.000
## Median : 3.0 Median :12.00 Median :12.00 Median :12.00 Median :1.500
## Mean : 4.5 Mean :11.49 Mean :11.47 Mean :11.46 Mean :1.889
## 3rd Qu.: 6.0 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:2.500
## Max. :45.0 Max. :18.00 Max. :18.00 Max. :18.00 Max. :5.000
## high_use
## Mode :logical
## FALSE:268
## TRUE :114
##
##
##
We are going to the following four variables in our analysis:
By this choices we want to see if there is evidence that males drink more and control this by age. Father’s education we use as a proxy variables for socioeconomic status. This gives us insight if people in lower socioeconomic status consume more alcohol. Lastly, we control alcohol consumption by school attendance.
To that end, we manipulate our data as follows.
library(tidyr)
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(dplyr)
analysisAlc <- joinedData %>%
select(sex, age, Fedu, absences, high_use)
analysisAlc$Fedu <- cut(analysisAlc$Fedu, breaks = c(-Inf, 0, 1, 2, 3, Inf), labels= c("None", "primaryLow", "primaryHigh", "secondary", "Tertiary") )
analysisAlc$Fedu <- factor(analysisAlc$Fedu)
summary(analysisAlc)
## sex age Fedu absences
## Length:382 Min. :15.00 None : 2 Min. : 0.0
## Class :character 1st Qu.:16.00 primaryLow : 77 1st Qu.: 1.0
## Mode :character Median :17.00 primaryHigh:105 Median : 3.0
## Mean :16.59 secondary : 99 Mean : 4.5
## 3rd Qu.:17.00 Tertiary : 99 3rd Qu.: 6.0
## Max. :22.00 Max. :45.0
## high_use
## Mode :logical
## FALSE:268
## TRUE :114
##
##
##
Let us next do some descriptive analysis for our data. We begin by plotting historgrams.
gather(analysisAlc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
From here we can see that approximately 1/3 of the sample are high users of alcohol. The individuals almost equally shared to female and male youngsters, mostly 15-18 years old, and their farthers are well educated. The distribution of absences in school has dispersion; there are students who are almost never absent and students whose are absent very often.
Next we draw some boxplots.
box <- ggplot(analysisAlc, aes(x = high_use , y = absences, col = Fedu ))
box + geom_boxplot() + ylab("Absences")
box2 <- ggplot(analysisAlc, aes(x = high_use , y = age, col = Fedu ))
box2 + geom_boxplot() + ylab("age")
The median number of absences is slightly greater among high alcohol users. Moreover, it seems to be so that high alcohol users are older (the median is higher). This, naturally, sounds reasonable statistics.
Next we do a table that shows the fraction of high alcohol users by the father’s educational background. It seems to be so there is no connection between high alcohol use and father’s education.
library(knitr)
library(dplyr)
table <- analysisAlc %>%
group_by(Fedu) %>%
summarise(mean_high_use=mean(high_use))
## `summarise()` ungrouping output (override with `.groups` argument)
kable(table)
| Fedu | mean_high_use |
|---|---|
| None | 0.0000000 |
| primaryLow | 0.3116883 |
| primaryHigh | 0.2857143 |
| secondary | 0.2727273 |
| Tertiary | 0.3333333 |
Next we analyse alcohol use by age and gender.
library(tidyr)
table <- analysisAlc %>%
group_by(sex) %>%
summarise(mean_high_use=mean(high_use))
## `summarise()` ungrouping output (override with `.groups` argument)
kable(table)
| sex | mean_high_use |
|---|---|
| F | 0.2121212 |
| M | 0.3913043 |
This indicates that boys drink more alcohol than girls.
In this section we use logistic regression to highlight our descriptive findings.
We start by running the model and summarizing the results
model <- glm(high_use ~ sex + absences + age + Fedu, data = analysisAlc, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ sex + absences + age + Fedu, family = "binomial",
## data = analysisAlc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2790 -0.8479 -0.6231 1.0561 2.1848
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.41011 623.86680 -0.030 0.976
## sexM 1.00410 0.24205 4.148 3.35e-05 ***
## absences 0.09335 0.02326 4.014 5.98e-05 ***
## age 0.16813 0.10351 1.624 0.104
## FeduprimaryLow 13.90961 623.86478 0.022 0.982
## FeduprimaryHigh 13.70235 623.86475 0.022 0.982
## Fedusecondary 13.59776 623.86476 0.022 0.983
## FeduTertiary 13.96070 623.86475 0.022 0.982
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 423.95 on 374 degrees of freedom
## AIC: 439.95
##
## Number of Fisher Scoring iterations: 13
coef(model)
## (Intercept) sexM absences age FeduprimaryLow
## -18.41010883 1.00410227 0.09335241 0.16813210 13.90961364
## FeduprimaryHigh Fedusecondary FeduTertiary
## 13.70235173 13.59776078 13.96070194
All the coefficients of the regression are positive. Hence, if an individual is old, male, and has a high number of absences, the probability of high alcohol consumption is greater. However, the effects of age and father’s education are not significant.
Let us next calculate odds ratios for the variables.
OR <- coef(model) %>% exp
CI <- confint(model) %>% exp
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 1.010628e-08 NA 1.713949e+33
## sexM 2.729456e+00 1.708562e+00 4.421144e+00
## absences 1.097849e+00 1.051069e+00 1.151547e+00
## age 1.183093e+00 9.669145e-01 1.452162e+00
## FeduprimaryLow 1.098673e+06 3.612856e-37 NA
## FeduprimaryHigh 8.930088e+05 2.829941e-37 NA
## Fedusecondary 8.043267e+05 2.570223e-37 NA
## FeduTertiary 1.156261e+06 3.630630e-37 NA
Here we again see that age and father’s education are not statistically significant.Therefore let us drop these two and run the model again.
model2 <- glm(high_use ~ sex + absences, data = analysisAlc, family = "binomial")
OR <- coef(model2) %>% exp
CI <- confint(model2) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.159445 0.1012577 0.2427684
## sexM 2.658116 1.6710354 4.2863129
## absences 1.101409 1.0549317 1.1548057
Now the odds ratio for the male indicator is statistically significant. The coefficient value 2.66 means that if a person is a male, the odds of high alcohol consumption is 2.6 times greater when absences are held constant. The odds ratio of abcenses is 1.1 indicating that if absences increase by one, the odds of high alcohol usage increases by 1.1.
Next we make some predictions.
library(visreg)
library(caret)
## Loading required package: lattice
library(sjlabelled)
##
## Attaching package: 'sjlabelled'
## The following object is masked from 'package:dplyr':
##
## as_label
model2 <- glm(high_use ~ sex + absences, data = analysisAlc, family = "binomial")
probs <- predict(model2, type = "response")
analysisAlc<- mutate(analysisAlc, probability = probs)
analysisAlc <- mutate(analysisAlc, prediction = probability > 0.5)
table(high_use = analysisAlc$high_use, prediction = analysisAlc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 258 10
## TRUE 88 26
table(high_use = analysisAlc$high_use, prediction = analysisAlc$prediction) %>%
prop.table() %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67539267 0.02617801 0.70157068
## TRUE 0.23036649 0.06806283 0.29842932
## Sum 0.90575916 0.09424084 1.00000000
g <- ggplot(analysisAlc, aes(x = probability, y = high_use , col = prediction))
g + geom_point()
We observe many false-negative predictions. That is, it is likely that our model indicates that a person does not consume lot of alcohol in the case if she or he really does so.
Based on the following loss functions we observe that approximately 1/4 our model’s predictions are wrong. The latter loss function value says our model is still better than just randomization.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = analysisAlc$high_use, prob = analysisAlc$probability)
## [1] 0.2565445
loss_func(class = analysisAlc$high_use, prob = 0.333)
## [1] 0.2984293
Today is
## [1] "Thu Dec 3 12:01:22 2020"
This week we use the Boston data from the MASS package. The data have 14 variables and 506 observartions for each variable. The variables are either numerical or integers.
# Package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Namely, the variables are
The correlation matrix can be illustrated by using corrplot package as follows
library(corrplot)
## corrplot 0.84 loaded
corr_matrix<-cor(Boston)
corrplot(corr_matrix)
That is, there is a a significant positive correlation between crim, rad, tax and lsat. On the other hand, the weighted mean of distances to five Boston employment centres, dis, has negative correlation with indus, nox, and age.
However, since the crime rate seems to be the variable in interest, let us illustrate it by some graphics.
require(ggplot2)
require(plotly)
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(data = Boston, x = ~crim, type = "histogram")
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
plot_ly(data = Boston, x = ~rad, y = ~crim)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
plot_ly(data = Boston, x = ~tax, y = ~crim)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
plot_ly(data = Boston, x = ~lstat, y = ~crim)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
Next we standardize the dataset and print out summaries of the scaled data.
scaled.Boston <- data.frame(scale(Boston))
Now all variables have mean of \(0\) and variance \(1\). For instance, now the plot of crim and lstat is the following.
plot_ly(data = scaled.Boston, x = ~lstat, y = ~crim)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
Next we divide crim in 5 categories based on the 20/%-quantiles.
library(gtools)
q.crim <- quantcut(scaled.Boston$crim,q = 5)
summary(q.crim)
## [-0.419,-0.413] (-0.413,-0.403] (-0.403,-0.356] (-0.356,0.229] (0.229,9.92]
## 102 101 101 101 101
scaled.Boston$crim <- q.crim
We divide the dataset into train and test sets, so that 80/% of the data belongs to the train set.
sample <- sample.int(n = nrow(scaled.Boston), size = floor(.8*nrow(scaled.Boston)), replace = F)
train <- scaled.Boston[sample, ]
test <- scaled.Boston[-sample, ]
The linear discriminant analysis:
# MASS and train are available
# linear discriminant analysis
lda.fit <- lda(crim~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crim ~ ., data = train)
##
## Prior probabilities of groups:
## [-0.419,-0.413] (-0.413,-0.403] (-0.403,-0.356] (-0.356,0.229] (0.229,9.92]
## 0.1955446 0.2004950 0.1980198 0.2103960 0.1955446
##
## Group means:
## zn indus chas nox rm
## [-0.419,-0.413] 1.2289308 -0.9919693 -0.12281893 -0.9117984 0.52532136
## (-0.413,-0.403] 0.1178038 -0.5224852 -0.02929819 -0.6534455 0.09327762
## (-0.403,-0.356] -0.2816979 -0.2714370 0.12138096 -0.2874372 0.01752820
## (-0.356,0.229] -0.4267078 0.5746297 0.23717802 0.8207357 -0.19851700
## (0.229,9.92] -0.4872402 1.0149946 -0.07298222 1.0253163 -0.29879457
## age dis rad tax ptratio
## [-0.419,-0.413] -0.8925501 0.9531282 -0.71438014 -0.7735768 -0.48814134
## (-0.413,-0.403] -0.4732191 0.3753418 -0.55793090 -0.5069924 -0.26401726
## (-0.403,-0.356] -0.0190914 0.1186790 -0.46793221 -0.4469321 0.02111665
## (-0.356,0.229] 0.6472870 -0.5257096 0.09228323 0.2269943 -0.14629009
## (0.229,9.92] 0.8525123 -0.8933315 1.65960290 1.5294129 0.80577843
## black lstat medv
## [-0.419,-0.413] 0.3872598 -0.80899932 0.58453774
## (-0.413,-0.403] 0.3562116 -0.41520978 0.25859279
## (-0.403,-0.356] 0.2987226 -0.01308464 0.09306665
## (-0.356,0.229] -0.1935722 0.14199508 -0.07891044
## (0.229,9.92] -0.9427289 0.97919705 -0.79055113
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4
## zn -0.012130754 0.83213259 1.10936204 -0.25605164
## indus 0.154075566 -0.31077073 0.24012061 0.14949310
## chas -0.074203341 -0.06224915 -0.07601345 -0.07200118
## nox 0.402418638 -0.90518526 1.26391889 -0.75211814
## rm 0.075865601 0.28911936 -0.23020678 -0.58379768
## age 0.243579699 -0.30414585 0.36050063 -0.20823905
## dis -0.066911227 -0.69397644 0.21749263 -0.54758555
## rad 1.624367299 1.05092627 0.09523746 -0.12096880
## tax -0.014558433 -0.19393270 -0.61632646 1.00328645
## ptratio -0.004230517 0.07997125 0.16193298 -0.83765381
## black -0.166249120 -0.01752590 -0.15002556 -0.08766222
## lstat 0.320361204 0.13178705 -0.74935128 -1.11112675
## medv 0.039301801 -0.47588448 0.03455914 -0.43736581
##
## Proportion of trace:
## LD1 LD2 LD3 LD4
## 0.8540 0.0933 0.0456 0.0071
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crim)
# plot the lda results
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
Next we cross tabulate the results with the crime categories from the test set.
# lda.fit, correct_classes and test are available
correct_classes <- test$crim
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct [-0.419,-0.413] (-0.413,-0.403] (-0.403,-0.356]
## [-0.419,-0.413] 12 8 3
## (-0.413,-0.403] 2 13 4
## (-0.403,-0.356] 0 6 10
## (-0.356,0.229] 0 0 4
## (0.229,9.92] 0 0 0
## predicted
## correct (-0.356,0.229] (0.229,9.92]
## [-0.419,-0.413] 0 0
## (-0.413,-0.403] 1 0
## (-0.403,-0.356] 5 0
## (-0.356,0.229] 6 6
## (0.229,9.92] 0 22
Let us next reload the Boston dataset and standardize it. Then we calculate the distances between the observations.
# load MASS and Boston
library(MASS)
data('Boston')
# standardization
Boston <- scale(Boston)
# euclidean distance matrix
dist_eu <- dist(Boston)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(Boston, method = "manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. The optimal number of clusters is when the value of total WCSS changes radically. In this case, two clusters would seem optimal.
# MASS, ggplot2 and Boston dataset are available
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
Today is
## [1] "Thu Dec 3 12:01:30 2020"
This week we use the ‘human’ dataset originates from the United Nations Development Programme. Graphical overview of the data and show summaries of the variables in the data is the following. We have 155 observations and 8 variables.
# Read the data
human <- read.csv("/Users/teemupekkarinen/IODS-project/data/human",row.names = 1)
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 0.198 0.931 0.861 0.977 0.989 ...
## $ Labo.FM : num 0.199 0.685 0.211 0.633 0.747 ...
## $ Edu.Exp : num 9.3 11.8 14 17.9 12.3 20.2 15.7 11.9 12.6 14.4 ...
## $ Life.Exp : num 60.4 77.8 74.8 76.3 74.7 82.4 81.4 70.8 75.4 76.6 ...
## $ GNI : int 1885 9943 13054 22050 8124 42261 43869 16428 21336 38599 ...
## $ Mat.Mor : int 400 21 89 69 29 6 4 26 37 22 ...
## $ Ado.Birth: num 86.8 15.3 10 54.4 27.1 12.1 4.1 40 28.5 13.8 ...
## $ Parli.F : num 27.6 20.7 25.7 36.8 10.7 30.5 30.3 15.6 16.7 15 ...
dim(human)
## [1] 155 8
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Namely, the variables are
The data sorted by the GNI
set.seed(1)
s <- sample(1:nrow(human),20)
subdata <- human[s,]
sort.GNI <- subdata$GNI[order(subdata$GNI,decreasing = T)]
barplot(sort.GNI, names.arg = row.names(subdata[order(subdata$GNI,decreasing = T),]), las=2)
The scatter plot of Life.Exp and GNI
library(ggplot2)
library(ggrepel)
list <- c("Australia",
"Azerbaijan","Belgium","Canada", "Chile", "China", "Cuba", "Denmark",
"Finland", "France", "Spain", "United States", "United Kingdom", "Germany", "India", "Greece", "Ghana", "Israel", "Hungary", "Niger", "Argentina", "Mexico", "Venezuela", "Mongolia", "Morocco", "Nepal", "Namibia", "Pakistan", "Peru", "Philippines", "Romania", "Tajikistan", "Tunisia", "Senegal", "Zambia")
s <- which(rownames(human) %in% list)
p <- ggplot(human[s,], aes(Life.Exp, GNI)) +
geom_point(color = 'red') +
theme_classic(base_size = 10) + labs(y = 'Gross National Income per capita', x = 'Life expectancy at birth')
p + geom_text_repel(aes(label = rownames(human[s,])),
size = 3.5)
The scatter plot of Edu.Exp and GNI
p <- ggplot(human[s,], aes(Edu.Exp, GNI)) +
geom_point(color = 'red') +
theme_classic(base_size = 10) + labs(y = 'Gross National Income per capita', x = 'Expected years in schooling')
p + geom_text_repel(aes(label = rownames(human[s,])),
size = 3.5)
The scatter plot of Edu2.FM and GNI
p <- ggplot(human[s,], aes(Edu2.FM, GNI)) +
geom_point(color = 'red') +
theme_classic(base_size = 10) + labs(y = 'Gross National Income per capita', x = 'The ratio of Female and Male populations with secondary education')
p + geom_text_repel(aes(label = rownames(human[s,])),
size = 3.5)
The scatter plot of Labo.FM and GNI
p <- ggplot(human[s,], aes(Labo.FM, GNI)) +
geom_point(color = 'red') +
theme_classic(base_size = 10) + labs(y = 'Gross National Income per capita', x = 'The ratio of labour force participation of females and males')
p + geom_text_repel(aes(label = rownames(human[s,])),
size = 3.5)
The scatter plot of Parli.F and GNI
p <- ggplot(human[s,], aes(Parli.F, GNI)) +
geom_point(color = 'red') +
theme_classic(base_size = 10) + labs(y = 'Gross National Income per capita', x = 'Women\'s participation in parliament (percent)')
p + geom_text_repel(aes(label = rownames(human[s,])),
size = 3.5)
To summarize the correlations betweem variables let us plot the correlation matrix.
library(corrplot)
corr_matrix<-cor(human)
corrplot(corr_matrix)
Principal Component Analysis (PCA) can be performed by two sightly different matrix decomposition methods from linear algebra: the Eigenvalue Decomposition and the Singular Value Decomposition (SVD).
Both methods quite literally decompose a data matrix into a product of smaller matrices, which let’s us extract the underlying principal components. This makes it possible to approximate a lower dimensional representation of the data by choosing only a few principal components.
Let us next perform the PCA for our human data.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM 5.607472e-06 -0.0006713951 -3.412027e-05 -2.736326e-04 0.0022935252
## Labo.FM -2.331945e-07 0.0002819357 5.302884e-04 -4.692578e-03 -0.0022190154
## Edu.Exp 9.562910e-05 -0.0075529759 1.427664e-02 -3.313505e-02 -0.1431180282
## Life.Exp 2.815823e-04 -0.0283150248 1.294971e-02 -6.752684e-02 -0.9865644425
## GNI 9.999832e-01 0.0057723054 -5.156742e-04 4.932889e-05 0.0001135863
## Mat.Mor -5.655734e-03 0.9916320120 1.260302e-01 -6.100534e-03 -0.0266373214
## Ado.Birth -1.233961e-03 0.1255502723 -9.918113e-01 5.301595e-03 -0.0188618600
## Parli.F 5.526460e-05 -0.0032317269 -7.398331e-03 -9.971232e-01 0.0716401914
## PC6 PC7 PC8
## Edu2.FM 2.180183e-02 -6.998623e-01 7.139410e-01
## Labo.FM 3.264423e-02 -7.132267e-01 -7.001533e-01
## Edu.Exp 9.882477e-01 3.826887e-02 7.776451e-03
## Life.Exp -1.453515e-01 -5.380452e-03 2.281723e-03
## GNI -2.711698e-05 8.075191e-07 -1.176762e-06
## Mat.Mor 1.695203e-03 -1.355518e-04 8.371934e-04
## Ado.Birth 1.273198e-02 8.641234e-05 -1.707885e-04
## Parli.F -2.309896e-02 2.642548e-03 2.680113e-03
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
library(ggfortify)
pca.plot <- autoplot(pca_human, data = human, colour = 'GNI')
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
pca.plot
And the same for standarized data:
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(scale(human))
pca_human
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM 0.35664370 -0.03796058 -0.24223089 0.62678110 -0.5983585
## Labo.FM -0.05457785 -0.72432726 -0.58428770 0.06199424 0.2625067
## Edu.Exp 0.42766720 -0.13940571 -0.07340270 -0.07020294 0.1659678
## Life.Exp 0.44372240 0.02530473 0.10991305 -0.05834819 0.1628935
## GNI 0.35048295 -0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor -0.43697098 -0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth -0.41126010 -0.07708468 0.01968243 0.04986763 -0.4672068
## Parli.F 0.08438558 -0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## Edu2.FM 0.17713316 0.05773644 0.16459453
## Labo.FM -0.03500707 -0.22729927 -0.07304568
## Edu.Exp -0.38606919 0.77962966 -0.05415984
## Life.Exp -0.42242796 -0.43406432 0.62737008
## GNI 0.11120305 -0.13711838 -0.16961173
## Mat.Mor 0.17370039 0.35380306 0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F 0.13749772 0.00568387 -0.02306476
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
pca.plot <- autoplot(pca_human, data = scale(human), colour = 'GNI')
pca.plot
In PCA scale is necessary as it removes the biases in the original variables. Especially in our case in which we have variables in different units. The standardized variables will be unitless and have a similar variance, which allows to divide the data in components that try to explain the variance of the data as much as possible(!). Hence, it is reasonable to interpret only the results for the standardised data.
The first principal component is able to capture 54 percent of the total variability of the data. The second component 16 percent and the third 10 percent. The lower components explain less than 10 percent in total of 30 percent. That is, three first principal components can explain 80 percent of the total variability of the whole human data.
From the biplot and loadings plot, we can see the variables Labo.FM and Parli.F are highly associated and form a cluster. On the other hand, variables Ado.Birth and Mat.Mor form a cluster and are associated as well. And the third and the greatest cluster is formed by Edu.Exp, Life.Exp,Edu2.FM, and GNI. In clusters there are a strong positive correlation between the variables. The length of PCs in biplot refers to the amount of variance contributed by the PCs. The longer the length of PC, the higher the variance contributed and well represented in space. If the variables are highly associated, the angle between the variable vectors should be as small as possible in the biplot. It seems to be so that the clusters are divided in three: women rights, child mortality, and education. The first principal component explains the variation in child mortality and education, and the second principal component the women rights.
Lastly we load the tea dataset from the package FactoMineR and explore the data briefly. The data used here concern a questionnaire on tea. They asked to 300 individuals how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions). A data frame with 300 rows and 36 columns. Rows represent the individuals, columns represent the different questions. The first 18 questions are active ones, the 19th is a supplementary quantitative variable (the age) and the last variables are supplementary categorical variables.
library(FactoMineR)
library(dplyr)
library(tidyr)
# reduced dataset
data(tea)
tea_time <- select(tea, c("Tea", "How", "how", "sugar", "where", "lunch"))
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
dim(tea_time)
## [1] 300 6
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
To summarize the data we find the following. People mostly drink tea in tea bags and without lemon and milk. The most common time to drink tea is not lunch time. People are divided between drinking tea with and without sugar. Earl Grey is the most popular brand and it people buy their tea from chain stores.
Based on the MCA we can conclude that unpacked tea is bought from tea shops and tea bags from chain stores. There is a cluster of drinking tea in outside of lunch time without any additional ingredients (milk or lemon). Moreover, it seems to be so that there is a little correlation between drinking Earl Grey with milk. Green tea is
Today is
## [1] "Thu Dec 3 12:01:37 2020"
This week we practise the analysis of longitudinal data. We will make graphical displays and summaries. We use the data RATS given by here: The data is from a nutrition study conducted in three groups of rats (Crowder and Hand, 1990). The three groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately weekly, except in week seven when two recordings were taken) over a 9-week period.
# Access the packages dplyr and tidyr
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
# Read the data
RATSL <- read.csv("/Users/teemupekkarinen/IODS-project/data/RATSL",row.names = 1,stringsAsFactors = T)
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group: int 1 1 1 1 1 1 1 1 2 2 ...
## $ Times: Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Rats : int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
dim(RATSL)
## [1] 176 5
summary(RATSL)
## ID Group Times Rats Time
## Min. : 1.00 Min. :1.00 WD1 :16 Min. :225.0 Min. : 1.00
## 1st Qu.: 4.75 1st Qu.:1.00 WD15 :16 1st Qu.:267.0 1st Qu.:15.00
## Median : 8.50 Median :1.50 WD22 :16 Median :344.5 Median :36.00
## Mean : 8.50 Mean :1.75 WD29 :16 Mean :384.5 Mean :33.55
## 3rd Qu.:12.25 3rd Qu.:2.25 WD36 :16 3rd Qu.:511.2 3rd Qu.:50.00
## Max. :16.00 Max. :3.00 WD43 :16 Max. :628.0 Max. :64.00
## (Other):80
# Factor ID & Group
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
# Take a glimpse at the RATSL data
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ Times <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1…
## $ Rats <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
# Table
kable(RATSL[1:16,])
| ID | Group | Times | Rats | Time |
|---|---|---|---|---|
| 1 | 1 | WD1 | 240 | 1 |
| 2 | 1 | WD1 | 225 | 1 |
| 3 | 1 | WD1 | 245 | 1 |
| 4 | 1 | WD1 | 260 | 1 |
| 5 | 1 | WD1 | 255 | 1 |
| 6 | 1 | WD1 | 260 | 1 |
| 7 | 1 | WD1 | 275 | 1 |
| 8 | 1 | WD1 | 245 | 1 |
| 9 | 2 | WD1 | 410 | 1 |
| 10 | 2 | WD1 | 405 | 1 |
| 11 | 2 | WD1 | 445 | 1 |
| 12 | 2 | WD1 | 555 | 1 |
| 13 | 3 | WD1 | 470 | 1 |
| 14 | 3 | WD1 | 535 | 1 |
| 15 | 3 | WD1 | 520 | 1 |
| 16 | 3 | WD1 | 510 | 1 |
# Draw the plot
ggplot(RATSL, aes(x = Time, y = Rats, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Rats), max(RATSL$Rats)))
That is, there are different 16 rats for which the table above gives the initial weight in day 1. The first figure shows how the weight of rats in each group evolve over the test period. Clearly there are differences in weights between groups. Moreover, in group 2 there is a clear outlier of a huge rat. It seems to be so that, in group there are the lightest rats, in group 2 the second lightest (exept the outlier rat), and in group the heaviest. In all groups the weight of rats increased during the test period. How much, is the question we try to answer next.
Once we standardise the weights of rats by grouping on each time period, we observe that the weight of rats in group 1 are all the time below the average (at each time) and rats in groups 2 and 3 above the average.
# Standardise the variable Rats
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdRats = (Rats - mean(Rats))/sd(Rats) ) %>%
ungroup()
# Glimpse the data
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, …
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1…
## $ Times <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, W…
## $ Rats <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 4…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8…
## $ stdRats <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, -…
# Plot again with the standardised Rats
ggplot(RATSL, aes(x = Time, y = stdRats, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized Rats")
Next we look at how the mean of rats in each group evolve over time. We observe that basically the means of weight of rats increase in every group
# Number of time periods, baseline (time 1) included
n <- RATSL$Time %>% unique() %>% length()
# Summary data with mean and standard error of Rats by Group and Time
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(Rats), se = sd(Rats)/sqrt(n) ) %>%
ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
# Glimpse the data
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36,…
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, …
## $ se <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3.3…
# Plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype=Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=4) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.5)) +
scale_y_continuous(name = "mean(Rats) +/- se(Rats)")
Next, we create a summary data by Group and ID with mean as the summary variable (ignoring baseline time period 1). The box plots of the full data and without the outlier rat are given below.
# Create a summary data by Group and ID with mean as the summary variable (ignoring baseline time period 1).
RATSL8S <- RATSL %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Rats) ) %>%
ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
# Glimpse the data
glimpse(RATSL8S)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, …
# Draw a boxplot of the mean versus treatment
ggplot(RATSL8S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Rats), time periods 2-64")
# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATSL8S1 <- RATSL8S %>%
filter(mean < 550)
# Glimpse the data
glimpse(RATSL8S1)
## Rows: 15
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, …
# Draw a boxplot of the mean versus treatment
ggplot(RATSL8S1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Rats), time periods 2-64")
Finally, we perform a two-sample t-test and observe the differences as seen in the boxplots above. We compute the analysis of variance table for the fitted model. We see that the baseline RATS is strongly related to the Rats values taken after treatment (Group) has begun, but there is still no evidence of a treatment (Group) difference even after conditioning on the baseline value.
# Perform a two-sample t-test
# Let us put groups 2 and 3 together in order to run the t-test
RATSL8S1$Group[which(RATSL8S1$Group==2)] <- 3
t.test(mean~Group, data = RATSL8S1, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -14.554, df = 13, p-value = 2.002e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -264.4733 -196.1053
## sample estimates:
## mean in group 1 mean in group 3
## 265.0250 495.3143
# Add the baseline from the original data as a new variable to the summary data
RATSL8S2 <- RATSL8S %>%
mutate(baseline = RATSL$Rats[which(RATSL$Times=="WD1")])
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + Group, data = RATSL8S2)
summary(fit)
##
## Call:
## lm(formula = mean ~ baseline + Group, data = RATSL8S2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.905 -4.194 2.190 7.577 14.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.16375 21.87657 1.516 0.1554
## baseline 0.92513 0.08572 10.793 1.56e-07 ***
## Group2 34.85753 18.82308 1.852 0.0888 .
## Group3 23.67526 23.25324 1.018 0.3287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.68 on 12 degrees of freedom
## Multiple R-squared: 0.9936, Adjusted R-squared: 0.992
## F-statistic: 622.1 on 3 and 12 DF, p-value: 1.989e-13
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
kable(anova(fit))
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| baseline | 1 | 253625.4006 | 253625.4006 | 1859.820128 | 0.0000000 |
| Group | 2 | 878.7458 | 439.3729 | 3.221896 | 0.0758554 |
| Residuals | 12 | 1636.4512 | 136.3709 | NA | NA |
In this part we go a little deeper to the analysis comparing to the Part 1. As in the course book it says:
“The summary measure approach to the analysis of longitudinal data described in the previous chapter sometimes provides a useful first step in making inferences about the data, but it is really only ever a first step, and a more complete and a more appropriate analysis will involve fitting a suitable model to the data and estimating parameters that link the explanatory variables of interest to the repeated measures of the response variable”
To that end, we use Linear Mixed Effects Models for repeated measures data. To begin to see how linear mixed effects models are applied in practice, we shall use some data from Davis (2002): “Here 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.”
We now describre the data in the similar way as for RATS data. The summaries and structure and a plot for the treatment group and control group is given. We observe the following
# Access the packages dplyr and tidyr
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
# Read the data
BPRSL <- read.csv("/Users/teemupekkarinen/IODS-project/data/BPRSL",row.names = 1,stringsAsFactors = T)
str(BPRSL)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
dim(BPRSL)
## [1] 360 5
summary(BPRSL)
## treatment subject weeks bprs week
## Min. :1.0 Min. : 1.00 week0 : 40 Min. :18.00 Min. :0
## 1st Qu.:1.0 1st Qu.: 5.75 week1 : 40 1st Qu.:27.00 1st Qu.:2
## Median :1.5 Median :10.50 week2 : 40 Median :35.00 Median :4
## Mean :1.5 Mean :10.50 week3 : 40 Mean :37.66 Mean :4
## 3rd Qu.:2.0 3rd Qu.:15.25 week4 : 40 3rd Qu.:43.00 3rd Qu.:6
## Max. :2.0 Max. :20.00 week5 : 40 Max. :95.00 Max. :8
## (Other):120
# Factor subject & treatment
BPRSL$subject <- factor(BPRSL$subject)
BPRSL$treatment <- factor(BPRSL$treatment)
# Take a glimpse at the RATSL data
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ weeks <fct> week0, week0, week0, week0, week0, week0, week0, week0, wee…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66,…
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# Table
kable(BPRSL[1:20,])
| treatment | subject | weeks | bprs | week |
|---|---|---|---|---|
| 1 | 1 | week0 | 42 | 0 |
| 1 | 2 | week0 | 58 | 0 |
| 1 | 3 | week0 | 54 | 0 |
| 1 | 4 | week0 | 55 | 0 |
| 1 | 5 | week0 | 72 | 0 |
| 1 | 6 | week0 | 48 | 0 |
| 1 | 7 | week0 | 71 | 0 |
| 1 | 8 | week0 | 30 | 0 |
| 1 | 9 | week0 | 41 | 0 |
| 1 | 10 | week0 | 57 | 0 |
| 1 | 11 | week0 | 30 | 0 |
| 1 | 12 | week0 | 55 | 0 |
| 1 | 13 | week0 | 36 | 0 |
| 1 | 14 | week0 | 38 | 0 |
| 1 | 15 | week0 | 66 | 0 |
| 1 | 16 | week0 | 41 | 0 |
| 1 | 17 | week0 | 45 | 0 |
| 1 | 18 | week0 | 39 | 0 |
| 1 | 19 | week0 | 24 | 0 |
| 1 | 20 | week0 | 38 | 0 |
# Draw the plot
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
Let us next jump into the regression model.
# create a regression model BPRSL_reg
BPRSL_reg <- lm(bprs ~ week + treatment, data = BPRSL)
# print out a summary of the model
summary(BPRSL_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
From here we observe that there is a little, unsignificant positive effect of treatment. Moreover, there is a significant negative effect of time.
This model assumes independence of the repeated measures of brps, which is unrealistic. Next we consider the random intercept model. Fitting a random intercept model allows the linear regression fit for each subject to differ in intercept from other subjects.
# dplyr, tidyr, RATS and RATSL are available
# access library lme4
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
# Create a random intercept model
BPRSL_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
# Print the summary of the model
summary(BPRSL_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
The week fixed effects remain negative and significant as well as the treatment effect is positive but unsignificant. Correlation of fixed effects shows negative effects on both week and treatment.
Next we create a random intercept and random slope model. “Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept but also in slope. This way it is possible to account for the individual differences in the rats’ growth profiles, but also the effect of time.”
# create a random intercept and random slope model
BPRSL_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRSL_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
# perform an ANOVA test on the two models
anova(BPRSL_ref1, BPRSL_ref)
## Data: BPRSL
## Models:
## BPRSL_ref: bprs ~ week + treatment + (1 | subject)
## BPRSL_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRSL_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRSL_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From here we observe that there are no very much different to the earlier results; actually all signs and significancies remain the same.
Finally, we can fit a random intercept and slope model that allows for a treatment \(\times\) week interaction.
# create a random intercept and random slope model
BPRSL_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRSL_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
# perform an ANOVA test on the two models
anova(BPRSL_ref2, BPRSL_ref1)
## Data: BPRSL
## Models:
## BPRSL_ref1: bprs ~ week + treatment + (week | subject)
## BPRSL_ref2: bprs ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRSL_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRSL_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# draw the plot of BPRSL
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs))) +
scale_x_continuous(name = "Week", breaks = seq(0, 8, 1))
# Create a vector of the fitted values
Fitted <- fitted(BPRSL_ref2)
# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>%
mutate(Fitted)
# draw the plot of BPRSL
ggplot(BPRSL, aes(x = week, y = Fitted, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs))) +
scale_x_continuous(name = "Week", breaks = seq(0, 8, 1))
By adding the interaction term the sign of treament changes to negative. the coefficient of the interaction term is positive, but pretty unsignificant. The similar effects happen in the correlation of fixed effects.
We fit the model and observe that the the fit is pretty good.